System and method for content-based partitioning and mining

ABSTRACT

Methods and systems are provided for partitioning data of a database or data store into several independent parts as part of a data mining process. The methods and systems use a mining application having content-based partitioning logic to partition the data. Once the data is partitioned, the partitioned data may be grouped and distributed to an associated processor for further processing. The mining application and content-based partitioning logic may be used in a computing system, including shared memory and distributed memory multi-processor computing systems. Other embodiments are described and claimed.

BACKGROUND

Given modern computing capabilities, it is relatively easy to collectand store vast amounts of data, such as facts, numbers, text, etc. Theissue then becomes how to analyze the vast amount of data to determineimportant data from less important data. The process of filtering thedata to determine important data is often referred to as data mining.Data mining refers to a process of collecting data and analyzing thecollected data from various perspectives, and summarizing any relevantfindings. Locating frequent itemsets in a transaction database hasbecome an important consideration when mining data. For example,frequent itemset mining has been used to locate useful patterns in acustomer's transaction database.

Frequent Itemset Mining (FIM) is the basis of Association Rule Mining(ARM), and has been widely applied in marketing data analysis, proteinsequences, web logs, text, music, stock market, etc. One popularalgorithm for frequent itemset mining is the frequent pattern growth(FP-growth) algorithm. The FP-growth algorithm is used for miningfrequent itemsets in a transaction database. The FP-growth algorithmuses a prefix tree (termed the “FP-tree”) representation of thetransaction database, and is faster than the other mining algorithms,such as the Apriori mining algorithm. The FP-growth algorithm is oftendescribed as a recursive elimination scheme.

As part of a preprocessing step, the FP-growth algorithm deletes allitems from the transactions that are not individually frequent accordingto a defined threshold. That is, the FP-growth algorithm deletes allitems that do not appear in a user-specified minimum number oftransactions. After preprocessing, a FP-tree is built, then theFP-growth algorithm constructs a “conditional pattern base” for eachfrequent item to construct a conditional FP-tree. The FP-growthalgorithm then recursively mines the conditional FP-tree. The patterngrowth is achieved by the concatenation of the suffix pattern with thefrequent patterns generated from the conditional FP-tree.

Since the FP-growth algorithm has been recognized as a powerful tool forfrequent itemset mining, there has been a large amount of research inefforts to implement the FP-growth algorithm in parallel processingcomputers. There have been two main approaches to implement FP-growth:the multiple tree approach and single tree approach. The multiple treeapproach builds multiple FP-trees separately, which results in theintroduction of many redundant nodes. FIG. 6 illustrates the multiplenodes generated by the conventional multiple tree approach with 1, 4, 8,16, 32 and 64 threads (trees). The example database used to generateFIG. 6 is a benchmark dataset “accidents”, which can be found at thelink “http://fimi.cs.helsinki.fi/data/” (the minimal support thresholdis 200,000). As shown, the multiple tree approach will generate two (2)times as many tree nodes on four (4) threads, and about nine (9) timesas many tree nodes on sixty-four (64) threads, as compared to only onethread. The shortcoming of building redundant nodes in multiple treesresults in great memory demand, and sometimes the memory is not largeenough to contain the multiple trees. The previous single approachbuilds only a single FP-tree in memory, but it needs to generate onelock that is associated with each of the tree nodes, thereby limitingscalability.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computing device including a miningapplication having content-based partitioning functionality, accordingto an embodiment.

FIGS. 2A-2B depict a flow, diagram which illustrates various logicaloperations by content-based partition logic, according to an embodiment.

FIG. 3 represents a probe tree data structure, under an embodiment.

FIGS. 4A-4D illustrate the first steps in a parallel build of a frequentpattern (FP) tree, according to an embodiment.

FIG. 5 depicts partitioned sub-branches and a grouping result for aprobe tree, according to an embodiment.

FIG. 6 illustrates the multiple nodes generated by a conventionalmultiple tree approach with 1, 4, 8, 16, 32 and 64 threads (trees).

DETAILED DESCRIPTION

Embodiments of methods and systems provide for partitioning data of adatabase or data store into several independent parts as part of a datamining process. The methods and systems use content-based partitioninglogic to partition the data by building a probe structure. In anembodiment, the probe structure includes a probe tree and associatedprobe table. Once the data is partitioned, according to content forexample, the resulting parts may be grouped and distributed to anassociated processor for further processing. The content-basedpartitioning logic may be used in a computing system, including sharedmemory and distributed memory multi-processor computing systems, but isnot so limited.

As described above, after the data is partitioned into independent partsor branches (e.g. disjoint sub-branches), each branch or group ofbranches may then be assigned to a processor (or thread) for furtherprocessing. In an embodiment, after scheduling by a master thread, eachtransaction is distributed to a corresponding processor to finishbuilding a specific branch of the FP-tree. According to the FP-growthalgorithm the data dependency only exists within each sub-branch, andthere is no dependency between the sub-branches. Thus many of the locksrequired in conventional single tree methodologies are unnecessary.Under an embodiment, a database is partitioned based on the content oftransactions in a transaction database. The content-based partitioningsaves memory, and eliminates many locks when building the FP-tree. Thememory savings are based on a number of factors, including thecharacteristics of the database, the support threshold, and the numberof processors. The techniques described herein-provide an efficient andvaluable tool for data mining using multi-core and other processingarchitectures.

In an embodiment, a master/slave processing architecture is used toefficiently allocate processing operations to a number of processors(also referred to as “treads”). The master/slave architecture includes amaster processor (master thread), and any remaining processors (threads)are designated as slave processors (slave threads) when building theFP-tree. One of the tasks of the master thread is to load a transactionfrom a database, prune the transaction, and distribute the prunedtransaction to each of the slave threads for a parallel build of theFP-tree. Each slave thread has its own transaction queue and obtains apruned transaction from the queue each time the master thread assigns apruned transaction to an associated slave thread. Each slave threaddisposes of its pruned transaction to build the FP-tree. Thisarchitecture allows the master thread to do some preliminary measures ofthe transaction before delivering it to a slave thread. According to theresults of the preliminary measures, the master thread may also decidehow to distribute the transaction to a particular thread. In alternativeembodiments, all threads operate independently to build the FP-treeusing the probe structure.

In the following description, numerous specific details are introducedto provide a thorough understanding of, and enabling description for,embodiments of the systems and methods. One skilled in the relevant art,however, will recognize that these embodiments may be practiced withoutone or more of the specific details, or with other components, systems,etc. In other instances, well-known structures or operations are notshown, or are not described in detail, to avoid obscuring aspects of thedisclosed embodiments.

FIG. 1 illustrates a computing device 100 including a mining application101 having content-based partition logic 102 which interacts with a datastore (or database) 104 when performing data mining operations,according to an embodiment. The mining application 101 and content-basedpartition logic 102 is described below in detail. The computing device100 includes any computing system, such as a handheld, mobile computingsystem, a desktop computing system, laptop computing system, and othercomputing systems. The computing device 100 shown in FIG. 1 is referredto as a multi-processor or multi-core device since the architectureincludes multiple processors 106 a-106 x. Tasks may be efficientlydistributed among the processors 106 a-106 x so as not to overwhelm anindividual processor. In other embodiments, the computing device 100 mayinclude a single processor and other components. The computing device100 typically includes system memory 108.

Depending on the configuration and type of computing device, systemmemory 108 may be volatile (such as random-access memory (RAM) or otherdynamic storage), non-volatile (such as read-only memory (ROM), flashmemory, etc.), or some combination.

The system memory 108 may include an operating system 110 and one ormore applications/modules 112. Computing device 100 may includeadditional computer storage 114, such as magnetic storage devices,optical storage devices, etc. Computer storage includes, but is notlimited to, RAM, ROM, electrically erasable programmable read-onlymemory (EEPROM), flash memory or other memory technology, compact diskROM (CD-ROM), digital versatile disks (DVD) or other optical storage,magnetic cassettes, magnetic tape, magnetic disk storage or othermagnetic storage devices, or any other medium which may be used to storeinformation. Computing device 100 may also include one or more inputdevices 116 and one or more output devices 118. Computing device 100 mayalso contain communication connections 120 that allow the computingdevice 100 to communicate with other computing devices 122, processors,and/or systems, such as over a wired and/or wireless network, or othernetwork.

FIGS. 2A-2B depict a flow diagram which illustrates various logicaloperations of the mining application 101 including the content-basedpartition logic 102, according to an embodiment. For this embodiment,processor 106 a (referred herein as the master processor) executes themining application 101 including the content-based partition logic 102,as described below. The transaction database of Table 1 below is used inconjunction with the flowchart to provide an illustrative example usingthe mining application 101 and content-based partition logic 102. Forexample, each transaction in Table 1 may represent a sequence of items,such as items purchased as part of a web-based transaction, wherein eachitem of the transaction is represented by a unique letter.

TABLE 1 Transaction Number Transaction 1 A, B, C, D, E 2 F, B, D, E, G 3A, B, F, G, D 4 B, D, A, E, G 5 B, F, D, G, K 6 A, B, F, G, D 7 A, R, M,K, O 8 B, F, G, A, D 9 A, B, F, M, O 10 A, B, D, E, G 11 B, C, D, E, F12 A, B, D, E, G 13 A, B, F, G, D 14 B, F, D, G, R 15 A, B, D, F, G 16A, R, M, K, J 17 B, F, G, A, D 18 A, B, F, M, O

At 200, the mining application 101 uses the content-based partitioninglogic 102 and begins by scanning the first half of a transactiondatabase, such as the transaction database in Table 1, to determinefrequent items based on a defined support threshold. Different supportthresholds may be implemented depending on a particular application. Forthis example, the support threshold is 12. At 202, each transaction ofthe first half of the transaction database is scanned and the number oftimes that each item occurs in the scan is counted, under an embodiment.Since the search covers half of the transaction database, a supportthreshold of six (6) is used to create the header table for the FP-treebased on the scan of the first half of the database. At 204, it isdetermined if the first half of the database has been scanned. If not,the flow returns to 200, and the next transaction is read. If the firsthalf of the database has been scanned at 204, then the logic 102, at206, creates a header table for the frequent items that meet the minimumsupport threshold based on the scan of the first half of the transactiondatabase.

The logic 102 also assigns an ordering to the frequent items identifiedin the first scan. In an embodiment, the logic 102 orders the frequentitems according to an occurrence frequency in the transaction database.For this example, the logic 102 orders the most frequent items as B thenA then D then F then G (B≧A≧D≧F≧G). This ordering is used to create theheader table and probe structure, described below. For this example,scanning the first half of the transaction database has resulted in theidentification of various items which are inserted into the header table(Table 2). As shown below, after the initial scan of the first half ofthe transaction database, the header table includes the identifiedfrequent items: item B occurring eight (8) times, item A occurring seven(7) times, item D occurring seven (7) times, item F occurring six (6)times, and item G occurring six (6) times.

TABLE 2 Item Count B 8 A 7 D 7 F 6 G 6

At 208, the logic 102 identifies frequent items which are used to builda probe structure. In an embodiment, the logic 102 identifies frequentitems which are used to build a probe structure, including a probe treeand associated probe table, described below. For this example, items B,A, D, and F are identified as items to be used in building a probe tree.At 210, logic 102 creates a root node of the probe tree and beginsbuilding an empty probe table. At 212, the logic 102 begins building theprobe tree from the root node using the first most frequent item in thetransaction database from the identified frequent items B, A, D, and F.At 214, the logic 102 adds a child with its item identification (ID) tothe probe tree. While building the probe tree, the logic 102 beginsfilling in an empty probe table (e.g. counts are set to zero shownbelow). As described above, the probe tree is built by evaluating theidentified frequent items. At 216, the logic 102 determines if there areany remaining frequent items to evaluate. If there are remaining items,the flow returns to 212 and continues as described above. If there areno remaining items, the flow proceeds to 218.

FIG. 3 represents a probe tree data structure (or probe tree) 300 whichincludes the identified items resulting from steps 208-216. A moredetailed description of the probe tree and probe table build follows.According to an embodiment, the logic 102 selects the first “M”identified frequent items to build the probe tree and associated probetable. As a result the probe tree and associated probe table include2^(M) branches. In experiments, M=10 is a sound choice for many cases.As described below, the 2^(M) branches may be further grouped to obtainsubstantially equal groups or chunks for further processing.

For this example, the probe tree 300 is a representation of theidentified frequent items (e.g. B, A, D, F (M=4)) and theirtransactional relationship to one another in accordance with theoccurrence frequency and content. For the first most frequent item, B,the logic 102 creates node 302 which corresponds to the most frequentitem B. The logic 102 takes the next most frequent item, A, and createsnode 304 corresponding to frequent item A. The logic 102 also createsblock 306 which shows that A is a “child” of B. Since A is only a childof B, the logic 102 takes the next most frequent items D, creating node308 corresponding to the next most frequent item. The logic 102 alsocreates blocks 310, 312, and 314, based on the determination that D is achild of B and A for each instance of B and A in the probe tree 300.

For the least most frequent item of this example, F, the logic 102creates node 316 which corresponds to the frequent item, F. The logic102 also creates blocks 318, 320, 322, 324, 326, 328, and 330,illustrating that F is a child of B, A, and D for each instance of B, A,and D in the probe tree 300. The corresponding probe table (Table 3) isshown below for the identified frequent items B, A, D, and F. At thispoint, logic 102 sets the count for each item to zero, but is not solimited. Each row of the probe table (Table 3) represents acontent-based transaction including one or more of the identifiedfrequent items B, A, D, and F. The number “1” is used to represent anoccurrence of a specific item in a transaction including the one or morefrequent items, whereas the number “0” is used to represent one or moreitems that do not occur in an associated transaction.

TABLE 3 B A D F Count 1 1 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 01 0 1 0 0 . . . 0 1 0 0 0 . . . 0 0 0 0 0

With continuing reference to FIG. 2, after creating the probe tree andempty probe table, the flow proceeds to 218, where the logic 102continues the first scan by evaluating the next transaction of thetransaction database. For this embodiment, the logic 102 scans thesecond half of the transaction database (transactions 10-18), evaluatingeach respective transaction. At 220, based on the scanned transactions,the logic 102 provides counts in the probe table and updates the headertable for the frequent items that meet the minimum support threshold.The updated probe table (Table 5) is shown below. At 222, if there areno further transactions in the transaction database, the flow proceedsto 224. Otherwise, if there are further transactions to process, theflow returns to 218. Based on the ordering of the identified frequentitems above (B≧A≧D≧F≧G), the logic 102 updates the header table (Table 4below).

TABLE 4 Item Count B 16 A 14 D 14 F 12 G 12

The updated probe table is shown below (Table 5), including the countfield resulting from the scan of the second half of the transactiondatabase. Each count represents the number of transactions which containthe respective identified frequent items after scanning the second halfof the transaction database. Table 5 represent the content-basedtransactions after the first scan of the transaction database.

TABLE 5 B A D F Count 1 1 1 1 3 1 1 1 0 2 1 1 0 1 1 1 1 0 0 0 1 0 1 1 21 0 1 0 0 . . . 0 1 0 0 1 . . . 0 0 0 0 0

At 224, each transaction or probe tree branch, including one or moreidentified frequent items, may be assigned to one or more processors forfurther conditioning. The probe tree and probe table may be described asa preprocessing of the transaction data which operates to partition thetransaction data for data mining. According to an embodiment, the probetable may be used to balance the processing load by identifying atransaction that consists, of one or more items, wherein eachtransaction includes a unique itemset. More particularly, as describedbelow, the logic 102 attempts to balance the workload, determined by thecontent of a particular transaction and the associated transaction countas identified by using the probe table and probe tree.

Based on a review of the probe table, there are three (3) transactionsthat include the identified frequent items B, A, D, and F. There are two(2) transactions that include the identified frequent items B, A, and D,but not F. There is one transaction that includes the identifiedfrequent items B, A, and F, but not D. There are no transactions thatinclude only the identified frequent items B and A. There are two (2)transactions that include the identified frequent items B, D, and F, butnot A. There is one transaction that includes the identified frequentitem A. The remaining transactions of the probe table may be determinedin a similar manner. As described below, there are fewer locksassociated with the nodes of the probe tree. Note that in theconventional single tree approach, each FP-tree node requires a lock.Since the probe tree does not require as many locks, a system using theprobe tree functionality has increased scalability.

The updated probe table may be used to construct the frequent-patterntree (FP-tree) by assigning transaction chunks to respective processors(such as 106 a-106 x), so that the individual processors each build arespective part of the FP-tree (a parallel build process). A groupingalgorithm, described below, may be used to group branches of the probetree into N groups, where N is the number of available processors orthreads. A group of branches may then be assigned to a respectiveprocessor. The probe tree and probe table provide an important tool forbalancing the computational load between multiple processors when miningdata, including building of the FP-tree.

Under an embodiment, the mining application 101 assigns a group oftransactions to a particular thread, so that each thread may processapproximately the same number of transactions. The assignment of groupsto the processors results in minimal conflicts between these processingthreads. As described below, a heuristic algorithm may be used to groupthe branches of the probe table (e.g. represented by the counts of eachrow) into a number of approximately equal groups which correspond to thenumber of processors used to build the FP-tree and perform the datamining operations.

For example, assume that the computing device 100 includes fourprocessors 106 a-106 d. According to an embodiment, one processor, suchas processor 106 a, is assigned as the master processor, and remainingprocessors, such as 106 b-106 d are assigned as slave processors. Themaster processor 106 a reads each transaction and assigns eachtransaction to a particular slave processor to build the FP-tree. At226, a root node is created for the FP-tree. At 228, the masterprocessor 106 a, using the mining application 101, scans the transactiondatabase once again, starting with the first transaction.

At 230, the master processor 106 a distributes each transaction to aspecific slave processor. At 232, each slave processor reads eachtransaction sent by the master processor 106 a. At 234, each slaveprocessor inserts its pruned-item set into the full FP-tree. At 236, themining application 101 determines whether each transaction of the fulldatabase has been read and distributed. If each transaction of the fulldatabase has been read and distributed, the flow proceeds to 240 and themining application 101 begins mining the FP-tree for frequent patterns.If each transaction of the full database has not been read anddistributed, the flow returns to 228. If there are no more transactionsto process, the flow continues to 240, otherwise the flow returns to232.

The distribution and parallel build of the FP-tree is shown in FIGS.4A-4D. As described above, a heuristic algorithm may be used to groupthe transactions of the probe tree into substantially equal groupingsfor processing by the four processors (the three slave processors buildrespective parts of the FP-tree). For this example, the miningapplication 101 uses the heuristic algorithm to group the branches ofthe probe tree into three groups since there are three slave processors.According to an embodiment, the heuristic algorithm assumes that Vi is avector which contains all chunks that belong to group i. Ci is thenumber of transactions of group i (the size of Vi). The goal is todistribute the transactions into (T−1) groups, making the total numberof transactions in every group approximately equivalent.

The algorithm may be formalized as follows:

V = {V₁, V₂, …  , V_(T − 1)}$f = {\sum\limits_{i = 1}^{T - 1}\left( {{Ci} - \frac{\sum\limits_{j = 1}^{T - 1}{Cj}}{T - 1}} \right)^{2}}$

The goal is to find V to minimize f T input is P={P1, P2, . . . , Pm}.The output is V={V₁, V₂, . . . , V_(T-1)}.

That is, for each P_(i):

Find j, Cj=min (Ck, k=1, 2, . . . , T-1)

Vj=Vj U chunk i

Cj=Cj+sizeof (chunk i)

Return V.

Consequently, after using the heuristic algorithm, the first groupincludes the three (3) transactions that contain the items B, A, D, andF. The second group includes the two (2) transactions with items B, Aand D, but without item F, and the one (1) transaction with items B, A,and F, but without item D (total of three (3) transactions). The thirdgroup includes the two (2) transactions with item B, D and F, butwithout item A, and the one (1) transaction with item A, but withoutitems B, D, and F (total of three (3) transactions). There are no countsfor other branches in the probe tree (see the probe table), so the logic102 only assigns groups of the above five (5) branches based on theresults of the heuristic algorithm. Based on the description above, themaster processor 106 a assigns the first group to the first slaveprocessor 106 b, the second group to the second slave processor 106 c,and the third group to the third slave processor 106 d for a parallelbuild of the FP-tree.

FIGS. 4A-4D illustrate the first steps in a parallel build of theFP-tree by the three slave processors 106 b-106 d. After the headertable and root node of the FP-tree are built as described above, themaster-processor 106 a reads the first transaction (A,B,C,D,E), pruningit to (B,A,D) according to the header table content. The masterprocessor 106 a, for this example, assigns (B,A,D) to slave processor106 c according to the grouping based on the heuristic algorithm, probetree, and probe table described above. As shown in FIG. 4A, afterreceiving the pruned items (B,A,D), slave processor 106 c sequentiallycreates nodes B, A, and D. The slave processor 106 c sets thecorresponding node links (shown as dotted lines) for slave processor 106c and also sets each node count to one (1).

The master processor 106 a reads the second transaction (F,B,D,E,G),pruning it to (B,D,F,G) according to the header table content. Themaster processor 106 a then assigns (B,D,F,G) to slave processor 106 d.Thereafter, as shown in FIG. 4B, slave processor 106 d inserts thepruned transaction (B,D,F,G) into the FP-tree by sequentially creatingnodes D, F and G as shown in FIG. 4B. The slave processor 106 d sets thecorresponding node links, sets the node count equal to one (1) for D, F,and G, while incrementing the node count for B by one (1). The masterprocessor 106 a reads the third transaction (A,B,F,G,D), pruning it to(B,A,D,F,G) according to the header table content. The master processor106 a then assigns the pruned transaction (B,A,D,F,G) to slave processor106 b. Thereafter, slave processor 106 b inserts the pruned transaction(B,A,D,F,G) into the FP-tree by sequentially creating nodes F and G andsetting the corresponding node links as shown in FIG. 4C. The slaveprocessor 106 b sets the node count equal to one (1) for F and G, whileincrementing the node count by one (1) for nodes B, A, and D in thecommon path.

The master processor 106 a reads the fourth transaction (B,D,A,E,G),pruning it to (B,A,D,G) according to the header table content. Themaster processor 106 a then assigns the pruned transaction (B,A,D,G) toslave processor 106 c. Thereafter, slave processor 106 c inserts thepruned transaction (B,A,D,G) into the FP-tree by creating node G andsetting the corresponding node links as shown FIG. 4D. The slaveprocessor 106 c sets the node count equal to one (1) for node G, whileincrementing the node count by one (1) for B, A, and D in the commonpath. The master processor 106 a continues scanning the transactiondatabase, assigning each transaction to respective slave processors asdescribed above. Each slave processor inserts the pruned transactionsinto the FP-tree until the full FP-tree is built including therepresentative links and node counts, as described above. In anembodiment, when creating the FP-tree, the multiple processors may notaccess a node with a lock while processing different transactions.Accordingly, the locks prevent multiple threads from attempting tocreate a duplicate node in the FP-tree. In an embodiment, the FP-tree isbuilt from and saved to system memory 108. In another embodiment, theFP-tree is built from and saved to a distributed memory system.

After the FP-tree is built, the FP-tree may be mined for specificpatterns with respect to the identified frequent items. The FP-tree maybe mined using various data mining algorithms. For example, conditional(pre-fix) based mining operations may be performed using the FP-tree.The mining of the FP-tree according to an embodiment may be summarizedas follows. Each processor may be used to independently mine a frequentitem of the FP-tree. For each frequent item, the mining processconstructs a conditional pattern base” (e.g. a “sub-database” whichconsists of the set of prefix paths in the FP-tree co-occurring with thesuffix pattern.

After constructing the conditional FP-tree, the conditional FP-tree isrecursively mined according to the identified frequent items. Asdescribed above, content-based partitioning logic 102 is used topartition data of a database or data store into several independentparts for subsequent distribution to the one or more processors. Thecontent-based partitioning logic 102 is used is to partition thedatabase equally so that the load is balanced when building and miningthe FP-tree (e.g. each processor gets an approximately equal amount oftransactions to process), but is not so limited.

Another example illustrates the use of the mining application 101including the content-based partitioning logic 102. Assume that adatabase contains eight (8) items (A, B, C, D, E, F, G, H). For thisexample, the database includes a total of eighty (80) transactions.After examining a first part of the database, such as the first half forexample, it is determined that items A, B, and C are the first three (3)most frequent items in the first part of the database. Next, thetransactions may be partitioned according to the content of the first Mmost frequent items. As described above, M may be heuristicallydetermined, based in part, on the number of available processors. Forthe present example, M is set equal to three (M=3), since there are 3most frequent items. Thus, the transactions may be partitioned into2^(M) or eight (8) content-based chunks according to items A, B, and C,A content-based chunk is equal to a sub-branch in an FP-tree. Table 6below provides an example of the distribution of transactions (e.g. 101means the associated chunk includes transactions having items A and C,but not B).

TABLE 6 ABC Transaction count 111 14 110 12 101 10 100 6 011 4 010 5 0015 000 24

Assuming two available slave threads, the content-based partition logic102 groups the eight chunks into two groups, one for each slave thread.As described above, a heuristic search algorithm may be used to groupthe eight chunks into two groups, wherein each group contains about thesame number of transactions. The two resulting groups may include thefirst group of chunks {111, 110, 101, 011} and the second group ofchunks {100, 101, 001, 000}, wherein each group contains forty (40)transactions. FIG. 5 depicts the partitioned sub-branches and groupingresult. The two sub-branches circled with dashed lines represent a firstgroup of chunks assigned to the first thread. The two other sub-branchescircled with solid lines represent the second group of chunks assignedto the second thread.

The equal numbers of transactions processed by each thread results in anefficient balance of the load when building the FP-tree. Additionally,after the partitioning using the probe tree, only eight (8) (2³) locksare required for the possible shared nodes (e.g. only consisting of a,b, c in the FP-tree) in the parallel tree-building process. Compared tothe 256 (2⁸)=256 locks for a conventional build. Thus, thepre-partitioning results in a large reduction in the amount of locks(e.g. power(2, items_num)−power(2, N)). Furthermore, the content-basedpartitioning may not require a duplication of nodes as in theconventional multiple tree approach. Accordingly, the content-basedpartitioning and grouping results in a good load balance among threads,while providing high scalability.

Table 7 below provides experimental results based on a comparison of thecontent-based tree partition and multiple tree approach. The testplatform was a 16-processor shared-memory system. The test benchmarkdatabases “accidents” may be downloaded directly from the link“http://fimi.cs.helsinki.fi/data/”, and the databases “bwebdocs” and“swebdocs” were cut from the database “webdocs.”

TABLE 7 Database 1P 2P 4P 8P 16P Multiple swebdocs 1102 613 359 197 125tree bwebdocs 278 155 92.7 52.8 42.9 accidents 23.2 14.1 8.44 6.62 4.95Tree swebdocs 1031 539 330 187 120 partition bwebdocs 278 155 92.5 53.740.2 accidents 22.3 14.3 8.78 6.20 5.29

Table 7 shows the running time for the whole application, including thefirst scan, the tree construction procedure and the mining stages. Table8 below shows the running times of the mining stage for each approach onall three benchmark databases.

TABLE 8 Database 1P 2P 4P 8P 16P Multiple swebdocs 1087 603 350 187 113tree bwebdocs 258 134 73.5 32.7 19.9 accidents 20.3 11.8 6.44 4.35 2.95Tree swebdocs 1018 527 320 176 110 partition bwebdocs 250 129 70.6 31.917.8 accidents 18.0 10.5 5.63 2.66 1.44

As shown in Tables 7 and 8 above, it is seen that the tree partitionapproach has a comparable performance with the old multiple treeapproach in total running time, and a little better in just the miningstage. The results are reasonable: in first and second stages the treepartition approach does additional partitions for each transaction basedon its content. However, the tree partition approach should be a littlefaster in the mining stage, since there are no duplicated tree nodes.

Aspects of the methods and systems described herein may be implementedas functionality programmed into any of a variety of circuitry,including programmable logic devices (“PLDs”), such as fieldprogrammable gate arrays (“FPGAs”), programmable array logic (“PAL”)devices, electrically programmable logic and memory devices and standardcell-based devices, as well as application specific integrated circuits.Implementations may also include microcontrollers with memory (such asEEPROM), embedded microprocessors, firmware, software, etc. Furthermore,aspects may be embodied in microprocessors having software-based circuitemulation, discrete logic (sequential and combinatorial), customdevices, fuzzy (neural) logic, quantum devices, and hybrids of any ofthe above device types. Of course the underlying device technologies maybe, provided in a variety of component types, e.g., metal-oxidesemiconductor field-effect transistor (“MOSFET”) technologies likecomplementary metal-oxide semiconductor (“CMOS”), bipolar technologieslike emitter-coupled logic (“ECL”), polymer technologies (e.g.,silicon-conjugated polymer and metal-conjugated polymer-metalstructures), mixed analog and digital, etc.

The term “processor” as generally used herein refers to any logicprocessing unit, such as one or more central processing units (“CPU”),digital signal processors (“DSP”), application-specific integratedcircuits (“ASIC”), etc. While the term “component” is generally usedherein, it is understood that “component” includes circuitry,components, modules, and/or any combination of circuitry, components,and/or modules as the terms are known in the art.

The various components and/or functions disclosed herein may bedescribed using any number of combinations of hardware, firmware, and/oras data and/or instructions embodied in various machine-readable orcomputer-readable media, in terms of their behavioral, registertransfer, logic component, and/or other characteristics.Computer-readable media in which such formatted data and/or instructionsmay be embodied include, but are not limited to, non-volatile storagemedia in various forms (e.g., optical, magnetic or semiconductor storagemedia) and carrier waves that may be used to transfer such formatteddata and/or instructions through wireless, optical, or wired signalingmedia or any combination thereof. Examples of transfers of suchformatted data and/or instructions by carrier waves include, but are notlimited to, transfers (uploads, downloads, e-mail, etc.) over theInternet and/or other computer networks via one or more data transferprotocols.

Unless the context clearly requires otherwise, throughout thedescription and the claims, the words “comprise,” “comprising,” and thelike are to be construed in an inclusive sense as opposed to anexclusive or exhaustive sense; that is to say, in a sense of “including,but not limited to.” Words using the singular or plural number alsoinclude the plural or singular number respectively. Additionally, thewords “herein,” “hereunder,” “above,” “below,” and words of similarimport refer to this application as a whole and not to any particularportions of this application. When the word “or” is used in reference toa list of two or more items, that word covers all of the followinginterpretations of the word: any of the items in the list; all of theitems in the list; and any combination of the items in the list.

The above description of illustrated embodiments is not intended to beexhaustive or limited by the disclosure. While specific embodiments of,and examples for, the systems and methods are described herein forillustrative purposes, various equivalent modifications are possible, asthose skilled in the relevant art will recognize. The teachings providedherein may be applied to other systems and methods, and not only for thesystems and methods described above. The elements and acts of thevarious embodiments described above may be combined to provide furtherembodiments. These and other changes may be made to methods and systemsin light of the above detailed description.

In general, in the following claims, the terms used should not beconstrued to be limited to the specific embodiments disclosed in thespecification and the claims, but should be construed to include allsystems and methods that operate under the claims. Accordingly, themethod and systems are not limited by the disclosure, but instead thescope is to be determined entirely by the claims. While certain aspectsare presented below in certain claim forms, the inventors contemplatethe various aspects in any number of claim forms. Accordingly, theinventors reserve the right to add additional claims after filing theapplication to pursue such additional claim forms for other aspects aswell.

1. A system for mining data comprising: a data store including datahaving a number of items; a mining application to mine data in the datastore, the mining application including logic, the logic, when executed,is to: identify a number of frequent items of the data store; compute aprobe structure based on the number of identified frequent items; and,partition the data according to content of the probe structure; whereinthe mining application uses the probe structure to build a frequentpattern tree (FP-tree); and a memory for storing the probe structure andthe FP-tree.
 2. The system of claim 1, wherein the data of the datastore includes a number of transactions, wherein each transactioncomprises a unique sequence of items identified by the logic whenidentifying the frequent items of the data store.
 3. The system of claim2, wherein the logic is to partition the transactions according tocontent of the identified frequent items to obtain the probe structure,wherein the probe structure includes combinations of the identifiedfrequent items and the number of occurrences of one or morecontent-based transactions.
 4. The system of claim 3, wherein the logicorders the identified frequent items based on an occurrence frequency ofeach identified item in the data store.
 5. The system of claim 3,further comprising a heuristic algorithm, wherein the heuristicalgorithm is to group the one or more content-based transactions intoapproximately equal groups.
 6. The system of claim 1, further comprisinga master processor and one or more slave processors, wherein the masterprocessor is to distribute a group of transactions to the one or moreslave processors to build the FP-tree.
 7. The system of claim 6, whereinthe one or more slave processors build a part of the FP-tree based onthe grouping of content-based transactions.
 8. The system of claim 7,wherein the multiple processors mine the FP-tree to determine uniqueinformation about the items of the data store.
 9. The system of claim 1,further comprising a multi-core system architecture.
 10. A system formining data, the system comprising: a database including a number oftransactions; at least one processor to perform mining operations on thedatabase, the at least one processor is to execute content-basedpartitioning logic on the transactions, wherein the content-basedpartitioning logic is to partition the transactions according to contentbased on a number of identified frequent items to obtain a probestructure; and a memory to store the probe structure.
 11. The system ofclaim 10, the probe structure further comprising a probe tree and probetable, wherein the probe tree and probe table further comprise 2^(M)branches, wherein M corresponds to the number of identified frequentitems.
 12. The system of claim 11, wherein the memory further comprisesshared memory to store the probe tree and probe table.
 13. The system ofclaim 11 further comprising multiple processors to recursively mine thedatabase, wherein each processor shares a substantially equal load basedon a grouping and distribution of the 2^(M) branches.
 14. The system ofclaim 13, the multiple processors further comprising a master processorand at least one slave processor to perform mining operations, whereinthe master processor distributes operations to the at least one slaveprocessor when building a frequent pattern tree (FP-tree) using theprobe structure.
 15. A method for mining data of a database, comprising:identifying frequent items of the database; building a probe structurebased on the identified frequent items, wherein each branch of the probestructure includes a number of identified frequent items based oncontent; grouping the branches of the probe structure based on thecontent of each branch; and building a frequent pattern tree (FP-tree)from the probe structure.
 16. The method of claim 15, further comprisingscanning a first portion of the database when identifying frequent itemsof the database, and scanning a second portion of the database whenbuilding the probe structure, wherein the probe structure includes anassociated number of counts with each branch of the probe structureafter scanning the second portion of the database.
 17. The method ofclaim 15, further comprising building the probe structure to include aprobe tree and probe table, and using the probe tree and probe table tobuild the FP-tree for mining the FP-tree to determine frequent datapatterns.
 18. The method of claim 15, further comprising distributingeach group of branches to an associated processor before building theFP-tree.
 19. The method of claim 18, further comprising using a masterprocessor to distribute each group of branches to one or more slaveprocessors, and using the one or more slave processors to build theFP-tree.
 20. The method of claim 15, further comprising partitioning thedatabase according to content of the identified frequent items to obtainthe probe structure, wherein the probe structure includes combinationsof the identified frequent items and the number of occurrences of one ormore content-based transactions.
 21. A computer-readable medium havingstored thereon instructions, which when executed in a system operate tomanage data of a database by: identifying frequent items of thedatabase; building a probe structure based on the identified frequentitems, wherein each branch of the probe structure includes a number ofidentified frequent items assorted by content; grouping the branches ofthe probe structure based on the content of each branch; and building afrequent pattern tree (FP-tree) from the probe structure.
 22. Thecomputer-readable medium of claim 21, wherein the instructions, whichwhen executed in a system operate to manage data of a database furtherby building the probe structure to include a probe tree and probe table,and using the probe tree and probe table to build the FP-tree for miningthe FP-tree to determine frequent data patterns.
 23. Thecomputer-readable medium of claim 21, wherein the instructions, whichwhen executed in a system operate to manage data of a database furtherby distributing each group of branches to an associated processor beforebuilding the FP-tree.